Fixed-rank matrix factorizations and 



c/3 



Riemannian low-rank optimization* 

B. Mishrat G. Meyer''' S. Bonnabel* R. Sepulchre''" 

September 4, 2012 

Abstract 

Motivated by the problem of learning a linear regression model whose parameter is 
a large hxed-rank non-symmetric matrix, we consider the optimization of a smooth cost 
function defined on the set of fixed-rank matrices. We adopt the geometric optimization 
framework of optimization on Riemannian matrix manifolds. We study the underlying 
geometries of several well-known fixed-rank matrix factorizations and then exploit the 
Riemannian geometry of the search space in the design of a class of gradient descent and 
trust-region algorithms. The proposed algorithms generalize our previous results on fixed- 
rank symmetric positive semidefinite matrices, apply to a broad range of applications, 
scale to high-dimensional problems and confer a geometric basis to recent contributions 
on the learning of fixed-rank non-symmetric matrices. We make connections with existing 
algorithms in the context of low-rank matrix completion and discuss relative usefulness 
of the proposed framework. Numerical experiments suggest that the proposed algorithms 
compete with the state-of-the-art and that manifold optimization offers an effective and 
versatile framework for the design of machine learning algorithms that learn a fixed-rank 
matrix. 

1 Introduction 



The problem of learning a low-rank matrix is a fundamental problem arising in many mod- 
ern machine learning applications such as collaborative filtering jiRS05], classification with 
multiple classes [AFSU07], learning on pairs [ABEV09], dimensionality reduction [CHH07J, 
learning of low-rank distances [KSD09, MB S lib] and low-rank similarity measures [S WClOj . 
multi-task learning |EMP05l MMBS11], just to name a few. Parallel to the development of 
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these new applications, the ever-growing size and number of large-scale datasets demands ma- 
chine learning algorithms that can cope with large matrices. Scalability to high-dimensional 
problems is therefore a crucial issue in the design of algorithms that learn a low-rank matrix. 
Motivated by the above applications, the paper focuses on the following optimization problem 

min We jr (r . )dlid2) /(W), (1) 

where / : R dlXrf2 — y R is a smooth cost function and the search space is the set of fixed-rank 
non-symmetric real matrices, 

J r (r,d 1 ,d 2 ) = {W G M dlXa!2 |rank(W) = r}. 

We show in Section [2] that the considered optimization problem (TjQ) encompasses various 
modern machine learning applications. We tackle problem ([1]) in a Riemannian framework, 
that is, by solving an unconstrained optimization on a Riemannian manifold in bijection with 
the nonlinear space T[r, d\, cfo)- 

The paper follows and builds upon a number of recent contributions in that direction: 
the Ph.D. thesis |Meyll| and several papers by the authors [MBSllal IMBSllbl IMMBS111 
MMS11, Jou09]. The main contribution of this paper is to emphasize the common framework 
that underlines those contributions, with the aim of illustrating the versatile framework of 
Riemannian optimization for constrained optimization. Necessary ingredients to perform both 
first-order and second-order optimization are listed. An attempt is also made to classify the 
existing algorithms into various geometries and show the common structure that connects 
them all. Scalability to large dimension problems is shown in numerical examples. The 
generic performance of all algorithms is similar but strong gains can be obtained by adapting 
geometries and parameterizations to specific problems, in the same way as different families 
of first-order or second-order optimization algorithms adapt to specific problems. 

2 Motivation and applications 

In this section, a number of modern machine learning applications are cast as an optimization 
problem on the set of fixed-rank non-symmetric matrices. 

2.1 Low-rank matrix completion 

The problem of low-rank matrix completion amounts to estimating the missing entries of a 
matrix from a very limited number of its entries. There has been a large number of research 
contributions on this subject over the last few years, addressing the problem both from a 
theoretical [CR081 IGrollj and from algorithmic point of view [RS051 ICCS081 lLB09l IMJD091 
IKMO101 ISElUl IJMD101 IMHT101 [BAT1] . An important and popular application of the low- 
rank matrix completion problem is collaborative filtering [RS05, ABEV09J. 

Let W* £ R d i xd 2 b e a matrix whose entries W*- are only given for some indices € $7, 
where Q is a subset of the complete set of indices {(i,j) : i 6 {1, ...,^1} and j S {1, ...,^2}}. 



Low-rank matrix completion amounts to solving the following optimization problem 

min WeMdlXd2 ^\\ Vn (W)-Vn(W*)\\ 2 F 
subject to rank(W) = r, 

where the function Psj(Wjj) = Wy if (i 3 j) € O and Pn(Wjj) = otherwise and the norm 
|| • \\f is Frobenius norm. The function Vq. is also called the orthogonal sampling operator and 
\£l\ is the cardinality of the set (equal to the number of known entries). 

The rank constraint captures redundant patterns in W* and ties the known and unknown 
entries together. The number of given entries is typically much smaller than d\d 2 , the 
total number of entries in W*. Recent contributions provide conditions on |0| under which 
exact reconstruction is possible from entries sampled uniformly and at random [CR08, CCS08, 
IKMUlOj . 

2.2 Learning on data pairs 

Given two types of data x € X (say M. dl ) and z£2 (say IR^ 2 ) associated with two types of 
samples and also given are the associated scalar observations y € M, learning on data pairs 
amounts to learning a predictive model y : X x Z M from training examples {(x,, Zj, yi)}™ =1 - 
When the predictive model y is chosen as the bilinear form 

y = X T Wz with W e T(r, d x , d 2 ) , 

then the problem boils down to the optimization problem (JTJ) 

n 

min We jr (r . )dlid2 ) ^2{£(yi,yi)} (3) 

i=l 

where the loss function £(y, y) penalizes the discrepancy between an observation y and the 
value that is predicted by the model y. An application of this setup is the inference of edges in 
bipartite or directed graphs. Such a problem arises in bioinformatics for the identification of 
interactions between drugs and target proteins, micro- RNA and genes or genes and diseases 
|YAG + 08~1 IBY09] . Another application is concerned with image domain adaptation [KSD11J 
where a transformation x T Wz is learned between labeled images x from a source domain X 
and labeled images z from a target domain Z. The transformation W is used on any new 
input data to map the data from one domain to the other. A potential interest of the rank 
constraint in these applications is to address problems with a high-dimensional feature space 
and perform dimensionality reduction on the two data domains. 

2.3 Multivariate linear regression 

Given matrices Y € W ixk (output space) and X £ M. nxq (input space), we seek to learn a 
weight /coefficient matrix W € M. qxk that minimizes the discrepancy between Y and XW 
[YELM07]. Here n is the number of observations, q is the number of predictors and k is the 
number of responses. One popular approach to multivariate linear regression problem is by 
minimizing a quadratic loss function. Note that in various applications responses are related 



and may therefore, be represented with much fewer coefficients. From an optimization point 
to view this corresponds to finding a low-rank coefficient matrix. The papers [YELM07, 
IAFSU07] , thus, motivate the rank-constrained optimization problem formulation. In the 
present framework it is defined as, 

min We j- (r . || Y - XW||^. 

Though the quadratic loss function is shown here, the optimization setup extends to other 
smooth loss functions as well. 

2.4 Numerical comparisons: Matrix completion as a benchmark 

To illustrate the notions presented in this paper, we consider the problem of low-rank matrix 
completion as the benchmark application. The objective function is a smooth least square 
function and the search space is the space of fixed-rank matrices. It is an optimization problem 
that has attracted a lot of attention in recent years. 

All simulations are performed in MATLAB on a 2.53 GHz Intel Core i5 machine with 4 
GB of RAM. We use MATLAB codes of the benchmark algorithms for our numerical studies. 
For each example, a d\ x d<i random matrix of rank r is generated according to a Gaussian 
distribution with zero mean and unit standard deviation and a fraction of the entries are 
randomly removed with uniform probability. The dimensions of d\ x di matrices of rank r is 
[d\ + d2 — r)r. The over-sampling (OS) ratio determines the number of entries that are known. 
A OS = 6 means that 6(d± -+- d% — r)r number of randomly and uniformly selected entries 
are known a priori out of d\d2 entries. Matrix reconstruction results have been proved in the 
context of fixed-rank optimization [KMO10] with an appropriate initialization. All algorithms 
(including competing methods) are initialized as proposed in [KMO10J. Numerical codes for 
the proposed algorithms are available from the first author's homepage. 

We consider the problem of completing a 32000 x 32000 matrix W* of rank 5 as the running 
example in many comparisons. The over-sampling ratio OS is 8 implying that 0.25% (2559800) 
of entries are randomly and uniformly revealed. The maximum number of iterations is fixed 
at 500 and the tolerance for the objective function is set at m\\Vn(W) -Vq(W*)\\ 2 f < 10~ 20 
(10 -25 for trust -region algorithms). A high tolerance is needed to observe the asymptotic rate 
of convergence of algorithms. For numerical comparisons with various competing algorithms 
we use the MATLAB codes supplied on authors' homepages. 

3 From matrix factorization to Riemannian quotient geometry 

We review three matrix factorizations for fixed-rank non-symmetric matrices and study the 
geometry of the resulting search space. Factorization models lead to product spaces of well 
known matrix manifolds. The quotient nature of the search space stems from the fact that a 
given element W 6 J-(r,d\,d2) is represented by an entire equivalence class of matrices due 
to intrinsic invariance properties of the factorization (Figured]). Understanding invariance is 
critical in designing algorithms [AMS08, AILH09]. An important observation is that the local 



minima are not isolated in the product space of factorization models. We seek to optimize 
the cost function in an abstract search space where the local minima are isolated. Such a 
space arises naturally in the factorization schemes as shown below and with a proper metric 
has the structure of a Riemannian manifold [AMS08]. 

The fixed-rank matrix factorizations of interest are rooted in the thin singular value de- 
composition of a r-rank matrix W = UX!V T , where U is a di-by-r matrix with orthogonal 
columns, that is, an element of the Stiefel manifold St(r,di) = {U G R dlXr : U T U = I}, 
S G Diag + (r) is a r-by-r diagonal matrix with positive entries and V 6 St(r, c^). The 
singular value decomposition (SVD) exists for any matrix W G J : (r,di,d2) [GVL96]. 




T(r,d u d 2 ) ~ Rf lXr x R** xr /GL{r) ~ St(r,d a ) x S ++ (r) x St(r, d 2 )/OG(r) ~ St(r,di) x Rf- xr /OG(r) 

Figure 1: The considered fixed-rank factorizations admit a quotient manifold geometry due 
to intrinsic invariance properties of factorization models. 

3.1 Full-rank factorization (beyond Cholesky-type decomposition) 

The first factorization is obtained when the singular value decomposition (SVD) is rearranged 
as 

W = (US5)(S5V T ) = GH T , 

where G = US^ £ M^ lXr , H = VSi G Mf 2Xr and lt^ xr is the set of full column rank 
dxr matrices, also known as full-rank matrix factorization. The resulting factorization is not 
unique because the transformation 

(G,H)h(GM^,HM t ), (4) 

where M G GL(r) = {M G IR rxr : det(M) ^ 0} leaves the original matrix W unchanged 
[P099|. The classical remedy to remove this indeterminacy is Cholesky factorization, which 
requires further (triangular- like) structure in the factors. LU decomposition is a way forward 
|GVL96| . In contrast, we encode the invariance map in an abstract search space by 
optimizing over a set of equivalence classes defined as 

[W] = [(G, H)] = {(GrVT 1 , HM T ) : M G GL(r)}. (5) 



The set of equivalence classes is termed as the quotient space of W by GL(r) and is denoted 

as 

W:=W/GL(r), (6) 

where the total space W is the product space M^ lXr x M^ 2Xr . 

Among the set of equivalence [( G, H)], a numerically well-conditioned representative is 
obtained by choosing (G,H) such that ||G||f = ||H||_p. This balancing update condition 
encodes the fact that both factors G and H have a comparable weight in the factorization. 
The usefulness of using the quotient geometry with the numerically-conditioned factors is 
demonstrated in Section \5. 21 

3.2 Polar factorization (beyond SVD) 

The second quotient structure for the set J-(r, di, cfo) is obtained by considering the following 
group action on the SVD [BS09J, 

(U,S,V) i — y (UO,O r SO,VO), 

where O is any r x r orthogonal matrix, that is, any element of the orthogonal group 

0(r) = {O € R rxr : O t O = OO t = I}. 
This results in polar factorization 

W = UBV T , 

where B is now a r x r symmetric positive definite matrix, that is, an element of 

S ++ (r) = {BeK rxr :B T = B^0}. (7) 

The polar factorization is close to the original purpose of singular value decomposition as 
identifying the geometries invariants of linear transformations |GVL96j . The choice of B 
being positive definite rather than 5] diagonal gives more flexibility in the optimization and 
removes the discrete symmetries induced by interchanging the order on the singular values. 
Empirical evidence to support the choice of S++(r) over Diag + (r) (set of diagonal matrices 
with positive entries) for the middle factor B is presented in Section 15.31 

The product space is St(r, d\) x S ++ {r) x St(r, g^)- The set F(r, d\,d^] is the quotient 
W/0(r), that is, the set of equivalence classes defined as 

[W] = [(U, B, V)] = {(UO, O t BO, VO) : O € O(r)}. (8) 

A relevant property of the factorization is that regularizing the norm of W is very cheap 
because it reduces to regularizing the norm of B which is of size r <C min(<ii, d%). For 
example, adding a regularization term proportional to the Frobenius norm ||W||^ to the cost 
function is equivalent to adding a regularization term proportional to ||B||^. Likewise, a 
regularization on the nuclear norm || W||* = Yl o~i where o~iS are the singular values of W, is 
equivalent to a regularization on the trace of B i.e., ||UBV T |[* = Tr(B) [MMBSllJ. 



3.3 Subspace-projection factorization (beyond QR decomposition) 

The third low-rank factorization is obtained from the SVD when two factors are grouped 
together, 

W = U(SV T ) = UZ T , 

where U £ St(r, d\) and Z G R^ 2Xr . The column subspace of W matrix is represented by U 
while Z is the (left) projection or coefficient matrix of W. The factorization is not unique 
as it is invariant with respect to the group action (U, Z) i-> (UO, ZO), whenever O € 0(r). 
The classical remedy to remove this indeterminacy is the QR factorization for which Z is 
chosen upper triangular |GVL96| . We work with the set of equivalence classes 

[W] = [(U, Z)] = {(UO, ZO) : O G 0(r)} (9) 

in the total space W := St(r, di) x M rf2Xr , viewing the search space as the quotient space 

W = W/0(r). (10) 

Recent contributions using this factorization include [BAllj, [SElOj. See the discussion in 
Section [5. II for a comparative review. 

4 Manifold-based optimization 

Classical optimization algorithms such as penalty methods, barrier methods or augmented 
Lagrangian methods [NW06] generally deal with structured matrix search spaces by means of 
explicit constraints or penalty terms expressed as a function of the decision variable. These 
methods turn a constrained optimization problem into a sequence of unconstrained opti- 
mization problems for which classical unconstrained optimization techniques are applied. An 
alternative is to embed the constraints into the search space and to solve an unconstrained op- 
timization problem on the constrained search space. This approach is taken by optimization 
algorithms on matrix manifolds [AMS08J. 

In a nutshell, a manifold W is a set endowed with a compatible differentiable structure. 
Once a manifold is endowed with a compatible differentiable structure, calculations can be 
performed and the classical geometric objects of optimization such as derivatives, gradients 
or Hessians all admit a generalization to manifolds. 

Important class of manifolds include embedded submanifolds and quotient manifolds. 
Embedded submanifolds can be viewed as a generalization of the notion of surface in the 
Euclidean space. They are defined by means of an explicit set of algebraic constraints in 
matrix space. This general concept applies straight to the Stiefel manifold St(r, d) that is 
regarded as a submanifold embedded in the Euclidean space M dxr . Another example is the 
d— 1-dimensional unit sphere S d ~ l embedded in which coincides with St(l,d). Likewise, 
the set of d-by-d orthogonal matrices 0(d) can be regarded as an embedded submanifold 
of M. dxd and coincides with St(d,d). The set, J-(r,d\,d2), of r — rank R rflXd2 matrices also 
forms a smooth manifold embedded in the Euclidean space R dlXrf2 and therefore, can also be 



F(r, di,d 2 ) 



Figure 2: A line-search on a Riemannian embedded submanifold J-(r, di,da) in the Euclidean 
space M. dlXd2 . The search direction £w t belongs to the tangent space Twt-7 7 - The retraction 
mapping R pulls back the new iterate onto the manifold .F(r, ^1,^2). 

viewed as an embedded submanifold [SW C1CH IVanllj . See Figure [2j Details in Section [5\4"1 
This geometry is different from the quotient geometry proposed in Section [3l 

The present paper focuses on the quotient manifold geometries induced by the factoriza- 
tion models of Section [3J Each point of a (matrix) quotient manifold represents an entire 
equivalence class of matrices. Abstract geometric objects on the quotient manifold can be 
defined by means of matrix representatives, provided that their definitions do not depend on 
a particular choice for the representative within an equivalence class. 

Let W be the total space that is equipped with an equivalence relation ~. For example, 
with the subspace-projection factorization model, each point x E W has the matrix represen- 
tation (U, Z) such that W = UZ T . The equivalence class (or fiber) of a given point x € W 
is defined by the set 

= {y € W : y ~ x} 
where x 6 W is the equivalence class [x]. By extension, the set 

W := W/ ~ = {[W] : W € W}, 

is the quotient manifold of W by ~. The mapping it : W —> W is called the quotient map 
or canonical projection. Clearly, we have ir(x) = ir(y) if and only if x ~ y. The set W is the 
total space of the quotient manifold W. 

A popular example of quotient manifold is the Grassmann manifold Gr(r, d), that is, the set 
of r-dimensional subspaces in R a!xr '. Gr(r, d) ~ M^ xr fO(r) where M. dxr is the set of full column 
rank d x r matrices and 0(r) is the set of r x r orthogonal matrices. Indeed, each subspace 
can be defined by a r-dimensional orthogonal frame up to a rotation matrix O € 0(r). The 
reader is referred to the papers of [EAS98] or [AMS04] for alternative characterizations of the 
Grassmann manifold Gr(r, d) as a quotient manifold. 

Notions on quotient manifold 

For quotient manifolds W = W/ ~ a tangent vector £ x G T X W at x = [x] is restricted to 
the directions that do not induce a displacement along the set of equivalence classes. This 



is achieved by decomposing the tangent space in the total space T X W into complementary 
spaces 

T w W = V w Wffi^wW. 
Refer Figure [3] for a graphical illustration. The vertical space V X W is the set of directions 




• X = [x] 

Figure 3: The quotient space. The points y and x in W belonging to the same equivalence 
class are represented by a single point [x] in the quotient space W. 

that contains tangent vectors to the equivalence classes. The horizontal space H X W is the 
complement of V X W in T X W. The horizontal space provides a representation of the abstract 
tangent vectors to the quotient space, i.e., T X W. Indeed, displacements in the vertical space 
leave the matrix W (matrix representation of the point x) unchanged, which suggests to 
restrict both tangent vectors and metric to the horizontal space. Once T X W is endowed with 
a horizontal distribution H X W, a given tangent vector £ x G T X W at x in the quotient manifold 
W is uniquely represented by a tangent vector £ 5 G T-L X W in the total space W. 

The tangent vector £ s 6 H X W is called the horizontal lift of £ x at x. Provided that the 
metric defined in the total space is invariant along the set of equivalence classes. A metric 
9x(£,x,Cx) in the total space defines a metric g x on the quotient manifold. Namely, 

9x(£x, Cx) ■■= g~x{£x, GO (if) 
where £ x and C, x are the tangent vectors in T X W and £ x and are their horizontal lifts in 

n x w. 

Natural displacements at x in a direction £ 5 € 1-L X W on the manifold are performed by 
following geodesies (paths of shortest length on the manifold) starting from x and tangent to 
This is performed by means of the exponential map 

x t+ i = Exp St (s t ^ t ), 

which induces a line-search algorithm along geodesies with the step size sj. However, the 
geodesies are generally expensive to compute and, in many cases, are not available in closed- 
form. A more general update formula is obtained if we relax the constraint of moving along 



W = W/ ~ x t 



Figure 4: A line-search on a Riemannian quotient manifold W in the total space W. Con- 
ceptually, we move on the quotient manifold from xt to xt+i but computationally in the total 
space W. The search direction is and belongs to the horizontal space "H^W at point 
xt- The retraction mapping maintains feasibility of the iterates. The blue line denotes the 
equivalence class [xt]. 



geodesies. The retraction mapping Rx t ( s t^x t ) a ^ %t locally approximates the exponential map- 
ping [AMS08]. It provides a numerically attractive alternative to the exponential mapping 
in the design of optimization algorithms on manifolds, as it reduces the computational com- 
plexity of the update while retaining the essential properties that ensure convergence results. 
A generic abstract line-search algorithm is, thus, based on the update formula 

xt+i = Rx t (stCx t ) (12) 

where St is the step-size. A good step-size is computed using the Armijo rule [NW06]. We 
use the information of the previous step-size to make the initial guess by using the following 
simple adaptive step-size update procedure. 

Given : §t (initial step — size guess) 

jt (number of line — search), and 
s t (optimal step — size) at iterate t 
Then : initial step — size guess at t + 1 iterate 
' 2s t , j t = 
2s t , j t >2. 



(13) 



St+l 



It ensures that on an average we keep the number of line-searches close to 1. 

Similarly, second-order algorithms on the quotient manifold W are horizontally lifted and 
solved in the total space W. Additionally, we need to be able to compute the directional 
derivative of gradient along a search direction. This relationship is captured by an affine 
connection V on the manifold. The vector field implies the covariant derivative of vector 
field rj with respect to the vector field £. In the case of W being a Euclidean space, the affine 



connection is standard 

v 4 Ux t->o t 

However, for an arbitrary manifold there exists infinitely many different affine connections 
except for a specific connection called the Levi-Civita or Riemannian connection which is 
always unique. The properties of affine connections and Riemannian connection are in section 
5.2 and Theorem 5.3.1 of AMS08J. The Riemannian connection on the quotient manifold W 
is given in terms of the connection in the total space W once the quotient manifold has the 
structure of a Riemannian submersion [AMS08J. 

Analogous to trust-region algorithms in the Euclidean space, trust-region algorithms on a 
quotient manifold with guaranteed quadratic rate convergence have been proposed in [ABG07, 
AMS08J . The trust-region subproblem on W is formulated as 

m.m^ TxW 4>(x) + g x (£,grad(f)(x)) + \g x (£, Hess0(x)[£]) 
subject to g x {£,€) < 5. 

where 5 is the trust-region radius and grad^ and Hessc/> are the Riemannian gradient and 
Hessian on W. The problem is horizontally lifted to the horizontal space % x yV [ABG07, 
BAG07J. Solving the above trust-region subproblem leads to a direction £ that minimizes 
the quadratic model. The trust-region subproblem is solved efficiently by the generic solver 
GenRTR [BAG07]. The potential iterate on the manifold W is obtained using the retraction 
operator R. Depending on whether the decrease of the cost function sufficient or not, the 
potential iterate is accepted or rejected. 



Numerical complexity 

The numerical complexity per iteration of the proposed gradient descent and trust-region 
algorithms depends on the computational cost of various components described before. All 
manifold related operations are of linear complexity in d\ and di. Other operations depend 
on the problem at hand and are computed in the search space W. With r <C minjdi, 0I2} the 
computational burden on algorithms is considerably low [MMBS11, Mcyllj. A tentative list 
of computational cost of geometry related operations is given in Section 15.11 



5 Riemannian geometries of rank-constrained matrices 

In this section, we present the geometric concepts that allow for a systematic derivation of line- 
search and trust-region algorithms on each of the factorization models described in Section 
El The tools are generic in nature and do not depend on the cost function at hand. The 
illustrations, however, are shown on the low-rank matrix completion problem (see Section 
12. ip . The main concepts and notations have been introduced in Section HI This section 
discusses the derivation of these objects. The developments are shown in detail for the 
subspace-projection factorization (Section I3.3() . For the other two factorization models, full- 
rank factorization and polar factorization, the developments are similar and we present only 
the final expressions. Details can be found in [MMBSilj |MeylT] . 



5.1 Geometry of algorithms using a subspace-projection factorization W = 



We a seek a local minimum to the following optimization problem 

minu.z 4>(U, Z) 



— (14) 
subject to (U, Z) G W 



where W is the total space St(r, d\) x R 2Xr and : W — )• M is a smooth function. As 
mentioned in Section [331 the local minima of the above optimization problem are not isolated 
in the space W and hence, we seek to solve the problem in the quotient space W = W/0(r). 
The problem (|14p is thus conceptually an unconstrained optimization problem on the quotient 
manifold W in which the minima are isolated. Computations are performed in the total space 
W, which is the product space of well-studied manifolds. 

Tangent space of W 

Tangent vectors at a point x G W have a matrix representation in the tangent space of the 
total space W. Because the total space is a product space St(r, d\) x M!j 2><r , its tangent space 
admits the decomposition at a point x = (U, Z) 

T S W = T u St(r,d 1 ) x T z < 2Xr 

and the following characterizations are well-known [EAS98, AMSC^. Note that R d2Xr is an 
open subset of 



TuStMi) = {Uft + V ± K\neS skew (r),KeR^- r ^ r } 

= {Z u -USym(U T Z u )|Z u eR nx P} 
T z Kf 2Xr = R d2Xr 

where S s k e w( r ) is the set of r x r skew-symmetric matrices, the operator Sym(A) = a+ 2 aT and 
Uj_ is the orthogonal complement of the space spanned by U. Note that an arbitrary matrix 
(Zu,Z z ) G R dlXr x R d2Xr is projected on the tangent space T S W by the linear operation 

* s (Zu, Z Z ) = (Zu - USym(U T Zu), Z z ). (15) 

A matrix representation of the tangent space at the equivalence class x = [x] £ W relies on 
the decomposition of T 5 W into its vertical and horizontal subspaces. The vertical space V^W 
is the subspace of T S W that is tangent to the equivalence class [x] 

v s w = {(un, zn)|n g 5 sfceu ,(r)}. (ie) 

The horizontal space ^^VV is chosen such that 

T S W = UxW V S W. (17) 
We choose 'H^W as the orthogonal complement of V^W for the metric 

9x(&,m) = Tr(^u) + Tr((Z T Z)- 1 eIj? B ), (18) 



which picks the normal metric of the Stiefel manifold [EAS98J and the right-invariant metric 
of R^ 2Xr [AMS04J. £ s and fj% are elements of T X W. The choice of this metric is motivated 
later in this section. With this choice, a horizontal tangent vector Q x is any tangent vector 
(CujCz) belonging to the set 

HzW = {(Cu,Cz)eT s W|^((Cu,Cz,) > (un,zn)) = o}. (19) 

Starting from an arbitrary tangent vector f\ x G T X W we construct its projection on the 
horizontal space by picking fl G S s k e w(p) such that 

n 5 (%)= (j)u-un,^-zn) enjv, (20) 

Using the calculation (fl~9j) , the unique fl that satisfies (f20|) is the solution of the following set 
of Lyapunov equations. We introduce an auxiliary variable f2d ummy € S s k ew ( r ) that simplifies 
the exposition for finding ft. The expression after introducing ridummy is 

(Z T Z)ft dummy + fi dummy (Z T Z) = 2Skew((Z T Z)(U T 7?u)(Z T Z)) - 2Skew((?gZ)(Z T Z)) 

(z T z)n + n(z T z) = n dummy 

(21) 

where the operator Skew(A) = A ~ 2 A . The numerical complexity of solving a Lyapunov 
equation is 0(r 3 ). 

The Riemannian submersion (W, g) 

The choice of the metric (|18p . which is invariant along the equivalence class [x], and of the 
horizontal space (|19p turn the quotient manifold W into a Riemannian submersion of (W, g) 
[AMS08]. As shown in [AMS08], this special construction allows for a convenient matrix 
representation of the gradient and the Hessian on the abstract manifold W. The Riemannian 
gradient of <p : W — > R : x \— > (j){x) = (f>(x) is uniquely represented by its horizontal lift in W 
which has the matrix representation 

grad^ = grad^. (22) 

It should be emphasized that grad s </> is in the the tangent space T S W. However, due to 
invariance of the cost along the equivalence class [x], grad 5 c/> also belongs to the horizontal 
space 7i x W and hence, the equality in (|22p [AMS08]. On the other hand, the matrix expression 
of grad 5 </> at a point x = (U, Z) is standard: the Euclidean gradient (Gradu^, Gradz^) must 
simply be projected on T X W, i.e., 

gradj.^ = ^(Gradu^, Gradz^) 

Likewise, the Riemannian connection V u n on W is uniquely represented by its horizontal lift 
in W which is 

V^ = U x {V p fj)- 



where v and n are vector fields in W and v and f/ are their horizontal lifts in W. In this case 
as well, the Riemannian connection Vpf/ on W has well-known expression |Jou091 Meyll 
IAMS08 ] , obtained by means of the Koszul formula, given by 



V 9 fj = 9 s (Dfj[P]) - * 5 



( ^uSym(U T 7?u), \ 
- r? z (Z T Z)- 1 Sym(Z r P z ) -^z(Z T Z)- 1 Sym(Z r r?z) 

V +Z(Z r Z)- 1 Sym(^P z ) / 

(23) 

Here Dr^[p] is the classical Euclidean directional derivative of fj in the direction D. The 
horizontal lift of the Riemannian Hessian in W has, thus, the following matrix expression 



Hessc^x) [£] = ( V ? -grad0) . (24) 
for any £ € T X W and its horizontal lift £ € 'H^W. 

Trust-region subproblem and retraction 

The trust-region subproblem on the abstract space W is horizontally lifted to 7-L x W and 
formulated as 



subject to 5- 



where 5 is the trust-region radius and grad^ and Hess^ are the horizontal lifts of the Rie- 
mannian gradient and Hessian on W. The trust-region subproblem is solved efficiently by the 
generic solver GenRTR [BAG07] . Refer [ABG071 |BAT?07] for the algorithmic details. The 
output is a direction £ that minimizes the model. 

To find the new iterate based on the obtained direction £, a mapping from the tangent 
space H x to the manifold W is required. This mapping is more generally referred to as 
retraction R x : 7i x W — > W (details in [AMS08J). In the present case, a retraction of interest 
is [AMS081 |Meyi1"1 

i*u(£u) = uf(u+eu) m) 

Rz{h) = Z + h 

where uf is a function that extracts the orthogonal factor [AMS08J i.e., 

uf(A) = A(A T A)~i 



Numerical complexity 

The numerical complexity of manifold-based optimization depends on the computational cost 
of the following components. 

• Objective function (j> — > problem dependent 

• Metric g — ► O{olir 2 + d 2 r 2 + 2r 3 ) 

• Euclidean gradient of cj> — > problem dependent 



— D?y[f] — > problem dependent 

— Matrix multiplication terms — > 0{d\r 2 + 3d2r 2 + 5r 3 ) 

• Projection operator ^ — > 0(d±r 2 ) 

• Projection operator II — > 0(d±r 2 + a^r 2 ) 

— Solving Lyapunov equation — > 0(r 3 ) 

• Retraction R — ► 0(dir 2 + 2r 3 ) 

As shown above all the manifold related operations are of linear complexity in d\ and di- 
Other operations depend on the problem at hand and are computed in the product space W. 

Discussion 

The choice of ([15]) is motivated by the fact that the total space St(r, d\) x M^ 2Xr equipped 
with this metric is a complete Riemannian space. An alternative to (|18p would be to consider 
the standard Euclidean metric 

9x&,fk) = TV(^ttu) + Tr(^z) (27) 

which is also invariant by rotations 0(r). £^ and % are elements of T^W. This metric is for 
instance adopted in [SE10] . Although this alternative choice is appealing for its numerical 
simplicity, Figure [5] clearly illustrates the benefits of optimizing in a complete metric space. 
Under identical initializations and choice of step-size rule [NW06], the invariant metric pre- 
vents the numerical poor conditioning that arises in the presence of unbalanced factors U and 
Z i.e., ||U||f ¥> \\^\\f- 
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Figure 5: The choice of an invariant metric for subspace-projection factorization algorithm 
dramatically affects the algorithm performance. The example shown is to complete a rank 5 
completion of a 4000 x 4000 matrix with 98% (OS = 8) entries missing but the observation 
is generic. 



This factorization is also exploited in the recent paper |BA11| for the low-rank matrix 
completion problem (the application is described in Section 12. ip . The RTRMC algorithm 
of [BA11] is an alternating minimization scheme where the authors exploit the fact that in 
the variable Z, minz 0(U, Z) is a least square problem that has a closed-form solution. 
They are, thus, left with an optimization problem in the other variable U on the Grassmann 
manifold Gr(r, d\). 

The resulting geometry of RTRMC is efficient in situations where d\ <C ofo for the low-rank 
matrix completion. The advantage is reduced in square problems and the numerical experi- 
ments in Figure [6] suggest that our generic algorithm compares favorably to the Grassmanian 
algorithm in [B Allj in that case. 
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Figure 6: Rank 5 completion of 32000 x 32000 matrix with OS = 8. The framework proposed 
in this paper is competitive with RTRMC. Top: Comparison of gradient descent schemes. 
The quotient geometry uses the Armijo rule with adaptive step-size update for computing the 
step-size. RTRMC first-order uses the RTRMC code with Hessaian replaced by identity which 
yields a steepest descent algorithm [BAllj . Below: Comparison of trust-region schemes. Both 
algorithms use the solver GenRTR [BAG07] to solve the trust -regions subproblem. 



5.2 Geometry of algorithms using full-rank factorization W = GH T 

The calculations in this part are similar to that of Section I5.ll Details are provided in 
|Meyll| . We provide here the final expressions for the ingredients that are needed for a a 



generic optimization scheme. 

The total space W is R^ lXr x R^ 2Xr and due to the product structure the tangent space 
T X W at x = (G, H) has the expression 

T X W = R dlXr x R d2Xr . 

Note that JR^ lXr is an open subset of ]R rflXr and thus, for an arbitrary matrix (Z g ,Zh) G 
H dlXr x R d2Xr , the projection operation \P S onto the tangent space T X W is the identity map 

¥ a (Z G ,Z H ) = (Z G ,Z H ). (28) 

We endow the tangent space with the metric that is invariant to the equivalence classes 
generated by the invariance W/GL(r). We equip the subspaces, both left and right, with the 
right-invariant metric 

H&Vx) = TV((G T G)- 1 ^ G ) + Tr((H r H)- 1 Clif7H) (29) 

where £ x ,fj x G T X W. The tangent space is decomposed into the vertical and horizontal space 
with expressions 

V S W = { (-GA, HA T ) | A G W' xr } and 

n s W = {(Cg,Ch) |CgGH T H = G t GH t Ch, Cg G M dlXr ,Cn G R d ^ r } 

where Q x G T X W and for any A G M rxr . An element % £ T X W is projected onto the horizontal 
space using the projection operator 

Kx(m)= (ffu + GA, f/H — HA T ) (31) 
where A G l rxr is uniquely obtained by solving the following Lyapunov equation 



(t7g + GA) T GH T H = G T GH r (fjn - HA T ) 

A T (G T GH T H) + (G T GH T H) A T = G T GH T r? H - 7^GH T H. 



(32) 



Note that (|32p is a Lyapunov equation of dimension r with the numerical cost 0(r 3 ). The 
expression of the Riemannian connection Vpf] in W is 

Vpfj = Dfj[u] + (P, Q) and 

P = - 77 G (G r G)- 1 Sym(G r ^G) - PG(G T G)- 1 Sym(G^ G ) 

+ G(G T G)- 1 Sym(^P G ), (33) 

Q = -%(H T H) -1 Sym(H T PH) -PH(H T H) -1 Sym(H T 7?H) 
+ H(H r H)- 1 Sym(^ H ). 

Here fj and P are any vector fields on W and Df][p] is the classical Euclidean directional 
derivative of f) in the direction v. Finally, we choose a retraction 



#g(£g) = G + £ g 
#h(£h) = H + £ H 



(34) 



As mentioned earlier in Section 13.11 we perform a balancing update on the retracted point 
to ensure better numerical conditioning. To balance the points (G,H) we use the following 
update rule 



Gbalanced 
-H-balanced 



io, 

a 

aH, 



where a S M+ is given by 



a: 



IHI 



(35) 



(36) 



Observe that GbaiancedH^ alanced = GH T . A different balancing rule is proposed in [M BSlla] . 

The resulting algorithm proceeds in two cascaded updates, see Figure [71 The first update 
moves the iterate from one equivalence class to another while minimizing the cost function of 
interest. The second "balancing" update is a change of representative along the given equiv- 
alence class. This scheme with cascaded retraction ensures that we asymptotically converge 
to a local minimum of the cost function with a balanced factorization. 




W = W/ 



Figure 7: The proposed algorithm based on the factorization W = GH T proceeds in two 
cascaded steps: a retraction step for cost minimization and a balancing step that ensures 
good numerical conditioning. The combination of these two steps correspond to a single 
iteration on the quotient manifold. 



Discussion 

The proposed scheme (gradient descent algorithm) is closely related to the gradient descent 
version of the Maximum Margin Matrix Factorization (MMMF) algorithm [RS05]. The gra- 
dient descent version of MMMF is a descent step in the product space M* 1 xr x M^ 2 xr equipped 
with the Euclidean metric, 

H&,m) = T*(I£»7g) + Tr(^ H ) (37) 

where £ x ,fjx £ T X W. Note the difference with respect to (I29p . As a result, the invariance (with 
respect to r x r non-singular matrices) is not taken into account. In contrast, the proposed 
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Figure 8: The proposed algorithm resolves the issue of choosing an appropriate step-size when 
there is a discrepancy between ||G||.p and ||H||jr, a situation that leads to a slow convergence 
of the MMMF algorithm. 



updates (fM|) and are invariant along the set of equivalence classes (0). This resolves the 
issue of choosing an appropriate step size when there is a discrepancy between ||G||_p and 
||H||ir. Indeed, this situation leads to a slower convergence of the MMMF algorithm, whereas 
the proposed algorithm is not affected (Figure [8]). To illustrate this effect, we consider a rank 
5 matrix of size 4000 x 4000. An incomplete matrix is generated under standard assumptions. 
2% of entries (OS = 8) are revealed uniformly at random. The step-size is computed using 
the Armijo rule [NW06J. Both algorithms are initialized similarly and the initial discrepancy 
between the factors at initialization is kept at ||Ho||_f ~ 10 4 ||Go||f- 

The LMaFit algorithm [WYZ10J for the low-rank matrix completion problem relies on 
the factorization W = GH T to alternatively learn the matrices W, G and H so that the 
error ||W — GH T ||^ is minimized while ensuring that the entries of W agree with the known 
entries i.e., Vq(W) = Vn(W*). The algorithm is a tuned version of the block-coordinate 
descent algorithm in the product space that has a superior computational cost per iteration 
and better convergence than the straight forward non-linear coordinate scheme. 

We compare both our gradient descent and trust-region algorithms with LMaFit in Figure 
Though the numerical cost per iteration for the trust-region algorithm is higher than the 
gradient descent implementation, the trust-region algorithm is the algorithm of choice when 
a higher accuracy is required. 

5.3 Geometry of algorithms using polar factorization W = UBV T 

This section contains the final expressions of ingredients that are needed to perform both first- 
order and second-order optimization. Details are provided in the earlier references, jMMBSlTl 
MeyllH EEHI- 



The total space W is St(r, d±) x S ++ (r) x St(r, d 2 ) and due to the product structure the 
tangent space T S W at x = (U, B, V) is 

T S W = TuSt(r,di) x T B S ++ (r) x T v St(r,d 2 ) 




Figure 9: Rank 5 completion of 32000 x 32000 matrix with OS = 8. LMaFit has a superior 
computational complexity per iteration but the convergence seems to suffer for very large-scale 
matrices with low-rank. 

and the following characterizations are well-known [EAS98J ISmi05| 

TuSt(r,di) = {Uft + U ± K|ft € S skew (r),K G R^-*)**} 

= {Z u -USym(U T Z u )|Z u £l dlXr ) 
TbS++(t) = S syrn (r) 

where S sym (r) is the set of r x r symmetric matrices, S s kew( r ) is the set of r x r skew-symmetric 
matrices and Uj_ is the orthogonal complement of the space spanned by U. Note that an 
arbitrary matrix (Zu,Zb,Zv) € M dlXr x W xr x R d2Xr is projected on the tangent space 
TxW by the linear operation 

* s (Zu, Z B , Z v ) = (Zu - USym(U r Zu), Sym(Z B ), Z v - VSym(V T Z v )). (38) 
The vertical space V^W is the subspace of T S W that is tangent to the equivalence class [x] 

v s w = {(un,Bn - cib, vn)|n e s skew (r)}. (39) 

We choose 'H^W as the orthogonal complement of V^W for the metric 

9xfa,m) = Tr(|S^u) + ^(B-^bB- 1 ^) m 
+Tr(^r/v) 

where ^ and fjx are elements of T^Yv '■ Starting from an arbitrary tangent vector % 6 T 2 .W 
we construct its projection on the horizontal space by picking fi G 5' s fc eu ,(r) such that 

IIxfe)= (f7u-UO ! ^B-(Bn-nB),^ v -Vfi) £%W, (41) 

The unique ft that satisfies (I4ip is the solution of the Lyapunov equation 

fiB 2 + B 2 ft = B(Skew(U T f/u) - 2Skew(B- 1 f/ B ) + Skew(V T f/ v ))B. (42) 

The computational cost of solving the Lyapunov equation ()42|) is 0(r 3 ). Similar to the cases 
of other factorization models, the product structure of W allows to write the Riemannian 
connection Vpf/ in the total space using the expression }Jou09l ISmi051 AMS08J 

V v f) = * 5 (Dt7[P]) - # s ( i/uSymtU^u), Sym( l / B B- 1 r ? B), ^Sym(V T 7 ?v ) ) (43) 



where D^f^] is the classical Euclidean directional derivative of r\ in the direction v. Finally, 
a retraction of interest is [AMS081 IMBSllaj 

#u(£u)=uf(U + £u) 

#b(£b) = B3exp(B-36,B-3)B3 (44) 
R-vitv) = uf(V + £ v ) 

where uf is a function that extracts the orthogonal factor (refer Section 15, ip and exp is the 
matrix exponential operator. The retraction on the positive definite cone S++(r) is the natural 
exponential mapping for the metric (lllj) [Smi05] . 

Discussion 

We illustrate here the empirical evidence that constraining B to be diagonal (as is the case 
with singular value decomposition) is detrimental to optimization. We consider the simplest 
implementation of a gradient descent algorithm for matrix completion problem (see below). 
The plots shown in Figure [10] compare the behavior of the same algorithm in the search space 
St(r, d±) x S ++ (r) x St(r, dq) (polar factorization) and St(r, di) x Diag + (r) x St(r, cfe) (SVD). 
Diag + (r) is the set of diagonal matrices with positive entries. The metric and retraction 
updates are same for both the algorithms. The only difference lies in constraining B to be 
diagonal which means that the Riemannian gradient for the later case is in the space Diag(r) 
(the tangent space of Diag + (r)). 

The empirical observation that convergence suffers from imposing diagonalization on B is 
a generic observation that does not depend on the particular problem at hand. The problem 
here involves completing a 4000 x 4000 of rank 5 from 2% of observed entries. 
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Figure 10: Convergence of a gradient descent algorithm is affected by making B diagonal. 

The OptSpace algorithm |KMO10] also relies on the factorization W = UBV T , but with 
B G M rxr . At each iteration, the algorithm minimizes the cost function (j) 

min UiV <KU,B,V) 



— Symmetric positive definite 

— Diagonal 




over the manifold St(r, di)/0(r) x St(r, d 2 )/C( r ) obtained by fixing B and then solving the 
inner optimization problem 

min B 0(U,B,V). (45) 

for fixed U and V. The algorithm thus alternates between a gradient descent step on the 
subspaces U and V for fixed B, and a least-square estimation of B (matrix completion 
problem) for fixed U and V. The proposed framework is different from OptSpace in the 
choice B positive definite versus B £ lR rxr . As a consequence, each step of the algorithm 
retains the geometry of polar factorization. Our algorithm also differs from OptSpace in the 
simultaneous and progressive nature of the updates. Furthermore, the choice B >- allows 
us to derive alternative updates based on different metrics on the set S ++ (r). This flexibility 
is exploited in a previous paper [MBS lib] to show that metrics of S++(r) are connected to 
Bregman divergences and information geometry. A potential limitation of OptSpace comes 
from the fact that the inner optimization problem (|45p may not be always solvable efficiently 
for other applications. 

The singular value projection (SVP) algorithm [JMD10J is based on the SVD factorization 
W = UBV T with B 6 Diag + (r), diagonal matrix with positive entries. It can also be 
interpreted in the considered framework as a gradient descent algorithm in the embedding 
space M. dlXd2 , along with an efficient SVD-based retraction exploiting the sparse structure 
of the gradient £euciidean (gradient in the Euclidean space R dlX(i2 ) for the matrix completion 
problem. A general update for SVP can be written as 

U+B+V^ = SVD r (UBV T - C cuc l idean ), 

where SVD r (-) extracts r dominant singular values and singular vectors. An intrinsic limi- 
tation of the approach is that the computational cost of the algorithm is conditioned on the 
particular structure of the gradient. For instance, efficient routines exist for modifying the 
SVD with sparse [Lar98| or low-rank updates |Bra06| . 

Figure [11] shows the competitiveness of the proposed framework of polar factorization 
model with the SVP algorithm. The test example is an incomplete rank 5 matrix of size 
32000 x 32000 with OS = 8. We could not compare the performance of the OptSpace algo- 
rithm as some MATLAB operations (in the code supplied by the authors) have not been opti- 
mized for large matrices. We have, however, observed the good performance of the OptSpace 
algorithm on smaller problems. 

5.4 Euclidean embeddings 

Another view point of the set of fixed-rank matrices is the embedded submanifold view. The 
search space F{r, d\, cfe) is the set of r — rank R dlXc(2 matrices. Recent papers |VanlHlSWC10| 
investigate the search space in detail and develop both first-order and second-order algorithms. 
While conceptually the iterates move on the embedded space, numerically the implementation 
is done using factorization models, full-rank factorization [SWC10J and singular value decom- 
position [Vanll] . We show the characterization of the tangent space using the factorization 
model W £ T with singular value decomposition W = USV r where S is diagonal with 




Figure 11: Illustration of the trust-region and gradient descent algorithms on low-rank matrix 
completion problem for polar factorization. The step-sizes for gradient descent algorithms are 
computed using the Armijo rule. For our gradient descent implementation we use (|13p for 
an initial step-size guess. On the other hand, SVP uses an initial guess pTrppgj with 5=1/3 
in [JMD10J. The main computational burden for SVP comes from computing the dominant 
singular value decomposition which is absent in the quotient geometry. 

positive entries [Vanllj and U and V are orthogonal matrices. The treatment is similar for 
the factorization W = GH T as the underlying geometry is the same fSWClOj . 
The tangent space expression at W = USV T is given by 



T W 7={UMV J +U P V J +UV;|M€lT xr ,Up€ir iXr ,Vp€llr 2Xr } (46) 

where Up" U = and VjV = 0. The set T is turned into a Riemannian submanifold of 
M. dlXd-2 with the Riemannian metric 

gw(£,v)=T±(£ T v) (47) 

which is the Euclidean inner product in R d ^ d 2 now restricted to the set T . £ and rj are 
tangent vectors. An arbitrary vector Z € R dlXd2 is projected on the tangent space Tw-T 7 is 
given using the operator IIw by computing 

M = UZV 

U p = ZV-UM (48) 
Vp = Z T U - VM T . 



A retraction is given by the formula 

i?w(C) = [U Up] 



Ir 



[V Vp] T . (49) 



Note that this can be computed efficiently due to singular value decomposition. The corre- 
sponding expressions for Riemannian gradient grad w /(W) for a smooth function / : T — >■ 
R : W /(W) is given by 

grad w / = n w (Grad w /) (50) 



where Gradw/ is the gradient of / in the Euclidean space M dlXd2 at W. Similarly, the 
expression of Hess/(W)[£] in the direction £ € Pw-P at W = USV T is also developed and 
specifically, for the matrix completion problem ([2]) is [VanllJ 

Hess/(X)[£] = Pu^(0^v + J P T i(^(0 + ^(W-W*)V ? ,5]- 1 V T )Pv (51) 
+Pu(7 3 n(C)+US- 1 UjPn(W-W*))P^ 

where W* is the matrix with partially revealed entreis, £ = UMV T + U p V T + UV^ £ Pw-P 
and the projection operators are P\j = UU T , P^ = I — Pu> Pv = VV T and P^ = I — Pv- 

Discussion 

The visualization of the search space as an embedded submanifold of R dlXd2 has some key 
advantages. For example, the notions of geometric objects can be interpreted in a straight 
forward way. In the matrix completion case, this allows to compute the initial guess for the 
Armijo line-search efficiently [Vanll] . 

On the other hand, the product space representation of the total space seems naturally re- 
lated to the matrix factorization and provides additional flexibility in the choice of the metric. 
It is only the projection to the horizontal space that couples the separated choice of a metric 
on each component of the product space. From the optimization point of view this flexibility 
is also of interest. For instance it allows for different regularization of the matrix factors. The 
numerical comparisons in Figure [12] suggest that quotient geometries are competitive. The 
acronyms Loreta and LRGeom stand for the algorithms proposed in SWCIO] and |Vanllj 
respectively. Loreta is a gradient descent algorithm based on the factorization W = GH T . 
LRGeom has both first-order and trust-region implementations [Vanll] . 

6 Conclusion 

We have addressed the problem of rank-constrained optimization (pQ) and presented both 
first-order and second-order schemes. The proposed framework is general and encompasses 
recent advances in optimization algorithms. In particular, we have studied various well-known 
algorithms for rank-constrained optimization and shown that most of these algorithms are 
based on three factorization models. Each factorization model offers its own advantages 
and is viewed as the product space of smooth and well-studied manifolds. This results in 
interesting invariance properties in the search space. We equip the product space with Rie- 
mannian geometry and develop necessary tools to perform optimization. Geometry related 
operations are linear in the number of rows (or columns) and exact expressions have been 
developed for all three factorization models. Additionally, the product structure enables us 
to use different metrics (subject to the constraint of respecting invariance) on the search 
space. We have shown one practical usefulness of choosing a proper metric in the context 
of subspace-projection factorization in Section 15.11 The usefulness of smooth search space 
has been discussed in the context of polar factorization in Section [5.31 where the flexibility of 
B to be positive definite results in good convergence properties. Similarly, the advantage of 




Figure 12: Low-rank matrix completion of size 32000 x 32000 of rank 5 with OS = 8. Both 
quotient and embedded geometries behave similarly. Top: Gradient descent algorithms. All 
algorithms are initialized similarly. The step-size for all algorithms is computed using Armijo 
rule along with adaptive step-size update (fT3|h For LRGeom the step-size guess suggested in 
[Va nllj is not used as it leads to a higher computational overload on the LRGeom algorithm 
than the adaptive step-size update. Below: Trust-region algorithms. The trust-region sub- 
problem is solved using GenRTR |BAG07| . In many instances LRGeom has a better rate of 
convergence. 

balancing an update has been discussed in Section 15.21 We have illustrated the usefulness of 
our framework on the application of low-rank matrix completion where we compete with the 
state-of-the-art algorithms on various benchmarks while maintaining a complete generality 
(with smoothness) of the cost function. 

So, which matrix factorization to use? This depends primarily on the application. If 
computational complexity (per iteration) is a criterion, then algorithms based on W = GH T 
have a slight advantage over W = UZ T and W = UBV T , in this order, owing to a cheaper 
retraction and projection operators in the space of full-rank matrices B^ xr than in St(r, d). 
Apart from this, the mentioned matrix factorizations also differ in one other aspect, namely, 
invariance in the space of M. dlXd2 . Simply put, given a matrix W £ F{r, di,^), what are 
the invariant group actions on W for each of the factorization geometry. The factorization 
W = UBV T captures the largest class of transformations that can be made invariant and 
W = GH T captures the least and W = UZ T places itself in between. Tuning the geometry 



and the metric to particular problems will be a topic of future research. 
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